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ABSTRACT 

We propose a natural way to generalize relative transfer func¬ 
tions (RTFs) to more than one source. We first prove that such 
a generalization is not possible using a single multichannel 
spectro-temporal observation, regardless of the number of mi¬ 
crophones. We then introduce a new transform for multichan¬ 
nel multi-frame spectrograms, i.e., containing several chan¬ 
nels and time frames in each time-frequency bin. This trans¬ 
form allows a natural generalization which satishes the three 
key properties of RTFs, namely, they can be directly estimated 
from observed signals, they capture spatial properties of the 
sources and they do not depend on emitted signals. Through 
simulated experiments, we show how this new method can 
localize multiple simultaneously active sound sources using 
short spectro-temporal windows, without relying on source 
separation. 

Index Terms — Relative Transfer Function, Grassman- 
nian manifolds, Pliicker Embedding, Multiple sound sources 
localization 

1. INTRODUCTION 

When sound propagates from an emitter to a receiver in a 
natural environment, objects along its path (e.g., a human 
or robot head, walls...) lead to reflections and reverberation. 
This is commonly modeled as a linear Altering and described 
by the convolution of the emitted signal with a so called room 
impulse response (RIR). For a given room, the latter only 
depends on the source’s spatial properties (position, orienta¬ 
tion, directivity, diffuseness, etc.) and not on the emitted sig¬ 
nal. The frequency domain counterparts of RlRs are acoustic 
transfer functions (ATFs). Knowledge of the ATFs involved 
in an acoustic setup is useful in many audio signal process¬ 
ing applications, e.g., blind source separation IT], beamform¬ 
ing lEl, sound source localization ma, acoustic echo can¬ 
cellation i). 

Most existing methods to estimate ATFs rely on the syn¬ 
chronized emitted and received signals. However, the emit- 
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ted signals are often not available, rendering the estimation of 
ATFs impossible without additional restrictive assumptions. 
For this reason, relative transfer functions are often consid¬ 
ered Q. These also capture source spatial properties and do 
not depend on the emitted signal, with the advantage that they 
can be reliably and robustly estimated directly from an ob¬ 
served multichannel signal |8]|9]. They are defined as a nor¬ 
malized version of ATFs, i.e., the ATF at a given microphone 
is divided by a linear combination of the ATFs to other mi¬ 
crophones, e.g., the ATF of a reference microphone. In the 
case of M = 2 microphones, the log-magnitude and phase of 
RTFs are referred to as interaural level and phase differences, 
respectively, in the binaural hearing literature ITOirTTIl . Re¬ 
cently, supervised sound source localization methods making 
use of a training set of interaural cues |5l or of RTFs 0 have 
been proposed. 

In this paper, we theoretically investigate the possibility 
of generalizing RTFs to more than one source. Such gener¬ 
alizations should preserve the three key properties of RTFs, 
namely, they can be directly estimated from observed sig¬ 
nals, they capture spatial properties of the sources and they 
do not depend on the emitted signals. We first state and prove 
a theorem showing that such a generalization is not possi¬ 
ble if a single multichannel spectro-temporal observation is 
used. We then consider the case of multiple time observa¬ 
tions, and propose a new transformation for multichannel, 
multi-frame spectrograms, i.e., containing several multichan¬ 
nel time frames in each time-frequency bin. This transfor¬ 
mation builds on the Pliicker embedding method for Grass- 
mannian manifolds. We show that it yields a natural gen¬ 
eralization of RTFs to multiple sources, when there are less 
sources than microphones. Through simulated experiments, 
we show how this method could be applied to the localization 
of multiple simultaneously active sound sources using short 
spectro-temporal windows, without having to separate them. 

2. GENERALIZING RTFS 

2.1. Single-source case and RTF properties 

Let us consider a sound source emitting the spectrogram 
t=i S recorded by an M-microphone ar- 



ray, where F and T are the number of frequency bands 
and the number of time frames, respectively. Under noise- 
free, finite convolutive filtering assumptions and for long 
enough time frames, the multichannel observation Xft = 

■ ■ ■ ,Xft^M]^ G received by M microphones at 
frequency-time (/, t) is given by 

Xft = afSft ( 1 ) 

where a/ = [a/,i,... ,a/,M] S comprises the acoustic 
transfer functions from the source to the M microphones at 
frequency /. For a given microphone setup in a given room, 
af solely depends on the source’s spatial properties. There¬ 
fore, ([T]i nicely decomposes the recorded signal into a compo¬ 
nent a f that only captures spatial properties and a component 
sft that only captures the source content at (/, t). 

If the emitted signal sft is unknown, unambiguously re¬ 
covering af from observation xft is impossible, without fur¬ 
ther assumptions. However, the specific structure of Eq. [T] 
offers an attractive way to circumvent this. Let be a nor¬ 
malizing function, which divides an input vector by a linear 
combination of its entries, e.g., the first entry. It is then easy to 
check that v{xft) = v{af) for all Xft € I, where I C 
is the nonzero locus of the linear combination. In other words, 
the signal term cancels out and T/t = r'ixft), when defined, 
captures only the spatial properties of the source. In the signal 
processing literature, Vf is referred to as a relative transfer 
function (RTF) I?]. In summary, relative transfer functions 
possess three key desirable properties: 

(I) They can be directly estimated from observed signals 

(II) They capture spatial properties of the sound source 

(III) They do not depend on the emitted signal 

Mathematically, these three properties are verified if and only 
if there exists a non-constant function g: X —>■ il and a func¬ 
tion h such that ([T]i => gixft) = h{af) for all Xft & X, 
where fl is an arbitrary set and X C C*^/{0}. 

2.2. Instantaneous multiple-source case 

In the case of K sound sources emitting spectrograms 
{sft,k}f)^i t=i for k = 1... K, model ([T]i becomes: 

K 

Xft = ^ af^kSft.k = ^f.KSft ( 2 ) 

fc=i 

where Sft = [s/t.i, ■ • ■ ,Sft,K]^ G is the vector of emit¬ 
ted signals and if = ..., G com¬ 

prises the K acoustic transfer functions capturing the sources’ 
spatial properties. An interesting question is: can we gen¬ 
eralize relative transfer functions to more than one source, 
while preserving properties (I), (II) and (111)7 In other words. 


is there a non-constant function g: X ^ and a function h 
such that g{xft) — /i(A/if) for all Xft G I? In this sec¬ 
tion, we prove that the answer is “no” through the following 
theorem: 

Theorem 1 Let X be a subset o/C^/{0}, Ll an arbitrary 
set, g: X ^ LI and h: —>• LI two functions and K > 1. 

If for all A G and for all s G with As G X we 

have g{As) = h{A), then g is constant. 

In other words, the only possible multiple-source instanta¬ 
neous generalizations of RTFs are constant, which violates 
property (II). 

Proof of Theorem [TJ 

Let p: X ^ Lt and h: H be two functions such that 

for all A G and for all s G with As gX we have 

g{As) = h{A). 

• Case K > M: Let A G be a fixed matrix with M 

linearly independent columns. Then, for all x G X, we have 
X = As with s = A^(AA^)“^a;. By definition of g and h, 
we thus have g{x) — g{As) = h{A) for all x G X. h{A) 
does not depend on x. Therefore, g is constant. 

• Case K < M: Let A G be a fixed matrix with K 

linearly independent columns. Let be the column space 
of A, i.e., the AT-dimensional vector subspace of defined 
by Ej^ = {As; s G C^}. We now prove that g{x) = h{A) 
for all X gX: 

- If at G then by definition of E^ there is s such that 
X = As, and thus g(x) = g{As) = h{A). 

- \f X E^, let x' G Ef^ n X. Then x and x' are linearly 
independent. Let A' = [x,x',a'^,..., a'jf\ G 

have K linearly independent columns (note that this is 
only possible because AT > 1). Lets = [1,0,... ,0]^ and 
s' = [0,1,0,..., 0]^, so that X — A's and x' = A's'. By 
definition of g and h, we have g{x) = g{A's) = h{A') = 
p(A's') = g{x'). Since x' G Ej^, we have g{x') = h{A) 
and thus g{x) = h{A). 

Thus, g{x) = h{A) for all x GX, and h{A) does not depend 
on X. Therefore, g is constant. ■ 

2.3. Multiple-frame, multiple-source case 

In this section we overcome the non-existence of an instan¬ 
taneous generalization of RTFs by proposing a multi-frame 
generalization. More precisely, we consider the case where 
K rather than one observations are available along the time 
axis. Using the following notations: 

^ft,K = [Xft, • ■ • , Xft+K-l] s , (3) 

S/t,ic = [s/t, ■ ■ ■, Sft+K-i] G , (4) 


we obtain a multiframe version of (|2|l for the time segment 
[t.. .t + K - l\. 

= ^f.K^ft,K- (5) 

We will refer to as a multichannel, K-frame 

spectrogram. Each time-frequency bin contains an M x K 
complex matrix. The question then becomes: is there a non¬ 
constant function g and a function h such that = 

h{Af,K) for all Af,K G andSft.K G From 

now and until the end of this paper, we will assume that the 
number of sources is strictly lower than the number of mi¬ 
crophones, i.e. K < M. Under this assumption, an inter¬ 
esting candidate solution is g = h = span, where span : 
CM^K Gy{K, C^) is the function associating a matrix to 
its column space. Gr{K, C^) is called a Grassmannian man¬ 
ifold: elements of this set are /f-dimensional linear subspaces 
of c“ lEiini- Assuming that the square matrix Sft^K has 
linearly independent columns (this assumption is further dis¬ 
cussed in Section 12.61) . it acts as a change of basis from the 
column space of Af K to the column space of Xft^K in equa¬ 
tion 0. Therefore, span(X/t if) = span(A/,if) does not 
depend on S/t,if, and span possesses the desired properties 
to generalize RTFs. 

However, the output values of span are not vectors but 
vector subspaces. These cannot be manipulated numerically. 
We thus need a way to map the Grassmannian manifold 
Gr(A', C^) to a numerical space. This is possible using a 
method known as Plucker embedding m. The method was 
hrst introduced in the case K = 2 and M = 4 by Julius 
Plucker in 1865, and later generalized to any K and M val¬ 
ues by Hermann Grassmann. Building on this, we propose a 
new transform for multichannel, multi-frame spectrograms. 
This transform applied to equation Q will yield an equation 
of the form ([T]), allowing a generalization of RTFs to multiple 
sources. We shall name it the Plucker spectrogram transform 
after the work of Julius Plucker. 


2.4. The Plucker spectrogram transform 


Let {Xft^K }f ’=i t=i ™ M-channel AT-frame spectro¬ 
gram. We denote by Xjj the K x K matrix 

formed by the K rows of Xft,K with indexes ii, Z 2 ,..., Ik- 
Let ^(1),... ,^{L) be the lexicographically-ordered list of 
cardinal-AT sublists of {1,..., M} with L = (^). We dehne 
the Plucker spectrogram transform of order K as follows: 




1 

lc\ 


( det(X^j^^|j(i)) \ 
det(X/t_if|^(2)) 


G C^. 


( 6 ) 


This transform applied to (|5]) yields the following remarkable 
identity: 


(7) 


This follows from the determinant property det(AB) = 
det(A) det(B) for square matrices A and B of equal sizes. 
Interestingly, Q has the same form as equation ([TJ. In 
other words, the Plucker spectrogram transform changes 
an M-microphone observation of K sources into an (^)- 
microphone observation of a single (compound) source. As a 
consequence, we have: 

f pK = l^{pK(Aft,K)) = ( 8 ) 

Therefore, T'/ ^f is a suitable generalization of RTFs to AT 
sources and M microphones (AT < M) using multiframe 
spectrograms. Namely, it verihes properties (I), (II) and (III), 
and for AT = 1, the RTF dehnition given in Section 12.11 is 
exactly recovered. 


2.5. Relation to subspace methods 

The proposed approach shares a lot of similarities with the 
so-called subspace methods for sound source localization. A 
well-known example is the method MUSIC, which stands for 
Multiple Signal Classihcation, flAlflSl . MUSIC starts by 
computing the covariance matrix of a multichannel signal in 
a given frequency band. An eigenvalue decomposition of this 
matrix is then performed, allowing to identify the signal sub¬ 
space, spanned by the principal eigenvectors, and the orthog¬ 
onal noise subspace, spanned by the remaining eigenvectors. 
As showed in Section lZ^ the signal subspace corresponds to 
the space spanned by the ATF, or equivalently the RTF vec¬ 
tors associated to the emitting sources, i.e., span(A/_if). In 
contrast, RTF vectors are orthogonal to the noise subspace. 
Therefore, sound source directions are those whose associ¬ 
ated RTF vectors have minimal projections onto the noise 
subspace. They are usually estimated by hnding the small¬ 
est projections of a predehned set of RTF vectors. 

Alternatively, in equation (HJ, we introduce a new vec¬ 
tor rf ,K which uniquely characterizes the signal subspace 
span(A/^f), using a minimal number of observations. This 
vector can thus be directly mapped to the spatial properties of 
all sources, provided that the associated mapping function is 
known. This mapping may either be directly obtained from 
a sound propagation model or learned from a predehned set 
of RTF vectors, as demonstrated in Section [3] An intrinsic 
difference between this approach and MUSIC is that it does 
not require the estimation and decomposition of covariance 
matrices. On the other hand, it requires a mapping from gen¬ 
eralized RTFs to multiple-source spatial characteristics, while 
MUSIC only requires single-source mappings. 


2.6. Conditions of applicability and properties 

Assuming that the normalizing function v divides a vector by, 
e.g., its hrst entry, ij is only valid if det(Xjt f 0. 


pK^Aft.K) = pKi^f, k) det{Sft.K)- 




Using (|6|, (I7]l and properties of the determinant, it follows 
that such singularity only occurs in the following situations: 

• If one or more sources are completely silent in all K time 
frames {t.. .t + K — 1) at frequency /. 

• If two or more sources are perfectly correlated over the 
segment, i.e., their absolute normalized cross-correlation 
is 1. 

• If two or more sources have similar spatial properties, i.e. 
o-f.k = ctaf i for some a G C,k ^ 1. This may occur if, 
e.g., they have identical directions in the free-held case. 

• If the K transfer functions and emitted signals are such 
that observations are linearly dependent, by coincidence. 

Let us dehne audio sources as objects emitting distinguishable 
sounds from distinguishable locations. Then, the hrst three 
cases may be interpreted as a violation of the assumption that 
there are K sources. The fourth case is harder to interpret, 
but it has a zero probability of occurrence assuming that dis¬ 
tinguishable transfer functions and signals are mutually statis¬ 
tically independent. In other words, the proposed generaliza¬ 
tion of RTF is sound if the assumed number of sources K is 
correct. If the actual number of sources P at (/, t) is less than 
K then (Xyt if) = 0. If P > K, the desirable properties 
are no longer preserved. A straightforward way to determine 
P is to note that: 

P = rank(X/t,if) for K > P. (9) 

If P < M, P can thus be deduced by successively calculating 
rank(Xyt jf) for K — 1... M — 1. 

3. SIMULATED EXPERIMENTS 

We test the potential of the proposed generalization of RTF 
for multiple sound-source localization (SSL). In what follows, 
spectrograms are computed on signals sampled at 8,000 kHz 
using 32 ms sliding windows with 50% overlap. This results 
in P = 128 positive frequencies and T = 64 time frames 
per second of signal. We use a dataset of head-related trans¬ 
fer functions (HRTFs) for the humanoid robot NAO. These 
HRTFs are simulated using a 3D model of the head in an 
anechoic environment and the boundary element method, as 
done in IfT^ . Corresponding impulse responses have a max¬ 
imal length of 10ms. The subset TL used contains N = 21 
HRTFs {af{9n)}fi^in=i C for the M = 4 micro¬ 
phones placed on the head. Here 0 = {9i ... 0Ar} is a set 
of source directions with azimuth and elevations randomly 
picked in [—180°, 180°] and [—10°, 10°] respectively. From 
this dataset, the following generalized RTF (GRTF) training 
sets are generated, for K=l to 3: 

T^k = {v{pK{[af{9i),..., af{9K)])); 

01 < •■• < 6»if e 0,/ = 1...P} 


Table 1. Mean absolute azimuth localization error using gen¬ 
eralized RTFs on mixtures of 1 to 3 sources, with 10 or 50 dB 
signal-to-noise ratios. 


Number of sources 

1 2 3 

GRTF (SNR=50 dB) 
GRTF (SNR=10 dB) 

0.04° 0.68° 1.45° 
10.9° 17.5° 27.4° 


where the cardinality of TZk is F(^). We then simulate all 
possible M-microphone mixtures of one to three white-noise 
sources coming from distinct directions in 0, by convolving 
random signals of one second duration with the HRTFs in 
FL. The minimum distance between distinct sources is 1° in 
azimuth and 3° in elevation. These mixtures are perturbed 
by additive Gaussian noise with 10 dB or 50 dB signaTto- 
noise ratios (SNRs). The Pliicker spectrogram transform of 
order AT (O is then applied to all individual AT-frame time 
segments of all these mixtures, where K is the number of 
sources, assumed known. The F GRTFs associated with the 
F frequency bins at each segment are concatenated and com¬ 
pared to those of the corresponding training set TZk, in terms 
of Euclidean distance. The set of K directions minimizing 
this distance gives the estimated sound source directions. For 
AT = 1, 2 and 3, this respectively corresponds to approxi¬ 
mately 1,300, 26,000 and 250,000 localization tasks using 
time segments of length 32ms, 48ms and 64ms. The mean 
computational times per source per second of signal where 
respectively 81ms, 87ms and 436ms using our Matlab imple¬ 
mentation on a conventional PC. Mean absolute azimuth lo¬ 
calization errors obtained with this procedure are summarized 
in Table[I](GRTF). 

The results confirm that the proposed generalization of 
RTF captures spatial properties of sources under low noise 
level (50 dB SNR). However, performance is severely de¬ 
graded for higher noise levels (10 dB SNR). While these re¬ 
sults are only preliminary, they reveal two intrinsic benehts 
of the proposed approach. First, it can localize AT simulta¬ 
neous sound sources using only AT spectrogram time frames. 
For a: = 3 and 50 dB SNR, 91% of the 250,000 individ¬ 
ual sources were perfectly localized using GRTFs on 64ms 
segments. This is impossible using methods such as MU¬ 
SIC CD, where at least M and typically more time frames 
are required to reliably estimate spatial covariance matrices. 
Second, the K sound sources are jointly localized without us¬ 
ing source separation, even though their spectra are strongly 
overlapping (white noise). This makes the method intrinsi¬ 
cally efficient computationally, and contrasts with many ex¬ 
isting multiple sound source localization methods, which rely 
on source separation SElIIsl. These two features put for¬ 
ward GRTFs as a promising tool to efficiently localize mul¬ 
tiple sound sources using short time windows. This ability 
may turn out to be critical, e.g., in realistic human-robot in¬ 
teraction scenarios where sound sources may be fast moving 
and computational resources are limited. 





4. CONCLUSION 

We proposed a natural way of generalizing relative transfer 
functions to K sources using K spectro-temporal observa¬ 
tions, where K is lower than the number of microphones. To 
the best of the authors’ knowledge, this is the first study of this 
kind in signal processing. This work is mostly preliminary 
and theoretical. In the future, we plan an in-depth theoretical 
and empirical study of the noisy case, and an extension to nat¬ 
ural sounds with sparse spectrograms such as speech. More¬ 
over, several leads will be investigated to improve robustness 
to noise, e.g., estimating the number of sources, combining 
Pliicker transforms of different orders and weighting time- 
frequency observations. Finally, the possibility of learning 
the mapping function from GRTFs to source directions will 
be investigated, following 0. This would bypass the need 
for a comprehensive training set containing all possible com¬ 
bination of source positions. 
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